Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean)#1218
Conversation
…pb 1.09785 (3-seed mean)
|
Awesome results @clarkkev! |
Strip complexity, bigger model, higher weight decay: MLP 3x → 4x (32.2M params vs 27M) MUON_WD 0.04 → 0.085 (better int6 compression) ADAM_WD 0.04 → 0.02 (scalars) BigramHash removed, VE removed QK_GAIN_INIT=4.0 PR openai#1218 proved this approach works: simplify + regularize = 1.098 on sp4096. On Scylla (998 tokens): should fit ~15.9MB at high WD.
|
so elegant, thing of beauty my friend. |
I am not the author of this PR, but I have spent some time on the same quantization-and-compression pipeline to have views on the moving parts. Mostly I think the discovery itself is the boring part, and the mechanism is the interesting part. The boring part is: you just instrument the exact export path matrix by matrix. For each tensor, log raw MB, quantized-and-compressed MB, and a few simple statistics on the float weights. Then scatter The reason it jumps out is that, in this pipeline, RMS sits upstream of almost everything the exporter cares about. GPTQ is using rowwise scales, so if a matrix has lower RMS, its rows usually get smaller scales and therefore a finer effective grid. That pushes more coefficients into small quantized values, especially I suspect the reason the R² got so absurdly high is that, in that regime, the matrices were fairly self-similar apart from scale. If the row max/RMS ratio and general histogram shape do not vary too much, then one scalar, RMS, ends up predicting most of the row-scale distribution seen by the quantizer, and from there a lot of the final compressed size is basically determined. I would not overstate it as a universal law, though. It is a very strong empirical regularity for this specific setup: rowwise low-bit quantization, pruning of small codes, byte shuffle, then Brotli. It can break if two matrices have the same RMS but different outlier structure, different max/RMS, different scale distributions, different bit assignments, or if one tensor bypasses the low-bit path. In short, RMS is an excellent first-order proxy for the rate term of the deployed artifact. The deeper point, to me, is that this is really a rate-distortion problem. RMS tells you a lot about the rate side, meaning how many bytes the matrix will want after quantization and compression. It does not tell you the whole distortion side, meaning how much loss you incur if you make that matrix smaller. I have seen cases where a matrix looks benign by raw MSE but is catastrophic by Hessian-weighted error. What I would recommend is to use weight decay as a crude shadow price on bytes, then spend the saved bytes on the matrices whose Hessian-weighted quantization damage is worst.
The simplifications make sense in the local logic of that PR, though not all for the same reason. Once you have the compression lever via RMS and a 4096 vocab, removing lexical auxiliaries like hash embeddings and the smear gate becomes much easier to justify, because some of what they were buying is now being bought more directly by the tokenizer and by the larger core model. Likewise, if stronger regularization is making the main weight banks cheaper to quantize and compress, then spending the recovered budget on a wider MLP is a very coherent move. At the same time, I do not think all of those removals should be read as settled truths. Some of them, especially the lexical extras, follow pretty naturally from the larger vocab; others feel more like strong empirical bets that also hand-wave away interactions that I suspect matter, particularly around matrix-specific quantization sensitivity and export behavior. My own not-yet-submitted SOTA PR has led me to somewhat different conclusions on a few of these choices. The space of good recipes is larger than any single record PR might suggest. My own framing I keep coming back to is that Parameter Golf is not about training a model and then compressing it, although so far most PRs read like that is exactly what it is. It is about learning an equivalence class of functions, then choosing the member of that class whose quantized, side-informed, bank-packed serialization has the lowest task loss at 16MB. Or, if you prefer it in fewer syllables: the true parameters are the bits. Train weights that the quantizer will thank you for. This submission is what it looks like when someone starts pulling on that thread. There is a lot more to find in this direction. Go further: shape the training dynamics from the ground up so the learned solution already lives in a compression-friendly, distortion-stable basin. You can push a surprisingly large model through the 16 MB bottleneck and have it come out the other side intact. |
…1.0929 (3-seed mean) Adds three techniques to PR openai#1218's 4096-vocab high-WD stack: - MuonEq-R optimizer (row-norm before NS5 orthogonalization) - Depth recurrence on layers 4,5 (shared MLP, zero extra params) - Mixed int5/int6 GPTQ via Hessian sensitivity ranking 3-seed mean: 1.0929 BPB / 2.5145 nats All seeds under 16MB (max: 15,981,324 bytes) No TTT, no SLOT, no eval-time adaptation.
Record: 4096-Vocab + Larger Model + High WD + Simplifications — val_bpb 1.09785
val bpb: 1.09785 (3-seed mean, std=0.0004)
Overview
This script builds on the 03-23 leaderboard record. The main changes are:
Fixes
window_starts = [ws for ws in range(0, total_tokens, stride) if min(ws + seq_len, total_tokens) - ws >= 1], and it should be:window_starts = [ws for ws in range(0, total_tokens, stride) if ws + seq_len - stride < total_tokens]Simplifications
Additions
data/download_hf_docs_and_tokenize.pyto build the sentencepiece tokenizer and pre-tokenized data. The tokenizer model grew by ~50kb, but even with that added, the final artifacts would be below the 16MB cap. A larger vocab means the model sees more context for the same sequence length and more train data per step.torch.sqrt(torch.mean(x**2))) with an R^2 near 0.99. This suggests that the weight decay is a good lever for reducing the compressed size, which can let us add more parameters to the model. In particular this script uses:mlp_mult3 -> 4.qk_gain_init1.5 -> 4 following #1125.